Recasting data

Sometimes long data needs to be wide, and sometimes wide data needs to be long. I'll explain.

You are soon going to discover that long before you can visualize data, you need to have it in a form that the visualization library can deal with. One of the ways that isn't immediately obvious is how your data is cast. Most of the data you will encounter will be wide -- each row will represent a single entity with multiple measures for that entity. So think of states. Your dataset could have population, average life expectancy and other demographic data.

But what if your visualization library needs one row for each measure? That's where recasting your data comes in. We can use a library called reshape2 to melt or cast the data, depending on what we need.

So let's transform a dataset we've already used -- registered voters in Nebraska -- from wide data to long data and back again. First, we'll import the library and then open the data.



In [1]:

    
library(reshape2)



In [2]:

    
voters <- read.csv("../../Data/registeredvoters.csv")



In [3]:

    
head(voters)









    





County Republican10 Democrat10 Libertarian10 Nonpartisan10 Total10 Republican16 Democrat16 Nonpartisan16 Libertarian16 Total16

	Adams   10018   5536    6       2972    18532   10746   5027    3591    163     19527   
	Antelope  3005   1147    0        538     4690    3088    863     594     12      4557   
	Arthur    284     52    0         10      346     286     37      15      3       341   
	Banner    424     53    0         53      530     427     38      73      7       545   
	Blaine    314     56    0         24      394     310     43      29      2       384   
	Boone    2390   1156    0        408     3954    2469    901     404     11      3785

Making data long, in most cases, is very, very easy. It's simple. We're going to create a new data frame called longvoters, and then melt our voters data into it. Then we'll run head and you'll see each measure gets it's own row -- so each county has 10 rows of data for it.



In [4]:

    
longvoters <- melt(voters)
head(longvoters)









    



Using County as id variables






    





County variable value

	Adams       Republican10 10018       
	Antelope    Republican10  3005       
	Arthur      Republican10   284       
	Banner      Republican10   424       
	Blaine      Republican10   314       
	Boone       Republican10  2390

But one problem that isn't immediately clear is where in spreadsheet world, a header like Republican10 and Republican16 makes sense, it doesn't here. We need fields to be able to have a County, a Party, a Year and then a count of those voters. So let's so some more worth with mutate and create those columns we need. And let's start exploring programmatic text manipulations. To simplify, we're going to use a library called stringr to get predictable patterns in our data.

So if you look at the variable field, you see we have the party at the front and the year in the back and the year is always the last two characters of the field. The problem? We don't know WHICH characters those are. But stringr has a powerful tool called str_sub that can get us that, if we know one trick: That telling it to start at -2 means go to the end of the word and move backwards two characters.



In [5]:

    
library(stringr)
library(dplyr)









    



Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [6]:

    
longvoters %>% mutate(
    Year = str_sub(variable, start= -2),
)









    





County variable value Year

	Adams       Republican10  10018      10          
	Antelope    Republican10   3005      10          
	Arthur      Republican10    284      10          
	Banner      Republican10    424      10          
	Blaine      Republican10    314      10          
	Boone       Republican10   2390      10          
	Box Butte   Republican10   4115      10          
	Boyd        Republican10   1036      10          
	Brown       Republican10   1663      10          
	Buffalo     Republican10  15768      10          
	Burt        Republican10   2521      10          
	Butler      Republican10   3044      10          
	Cass        Republican10   8216      10          
	Cedar       Republican10   3072      10          
	Chase       Republican10   1827      10          
	Cherry      Republican10   2945      10          
	Cheyenne    Republican10   4146      10          
	Clay        Republican10   2889      10          
	Colfax      Republican10   2364      10          
	Cuming      Republican10   3639      10          
	Custer      Republican10   5450      10          
	Dakota      Republican10   3577      10          
	Dawes       Republican10   3505      10          
	Dawson      Republican10   7666      10          
	Deuel       Republican10   1027      10          
	Dixon       Republican10   1887      10          
	Dodge       Republican10  10683      10          
	Douglas     Republican10 120973      10          
	Dundy       Republican10   1013      10          
	Fillmore    Republican10   2317      10          
	⋮ ⋮ ⋮ ⋮
	Nemaha      Total16       4563      16          
	Nuckolls    Total16       3162      16          
	Otoe        Total16      10447      16          
	Pawnee      Total16       1858      16          
	Perkins     Total16       1982      16          
	Phelps      Total16       6235      16          
	Pierce      Total16       4872      16          
	Platte      Total16      20116      16          
	Polk        Total16       3516      16          
	Red Willow  Total16       7239      16          
	Richardson  Total16       5669      16          
	Rock        Total16       1024      16          
	Saline      Total16       7830      16          
	Sarpy       Total16     108473      16          
	Saunders    Total16      14696      16          
	Scotts Bluff Total16      24139      16          
	Seward      Total16      10926      16          
	Sheridan    Total16       4066      16          
	Sherman     Total16       2010      16          
	Sioux       Total16        919      16          
	Stanton     Total16       3782      16          
	Thayer      Total16       3791      16          
	Thomas      Total16        506      16          
	Thurston    Total16       4529      16          
	Valley      Total16       2995      16          
	Washington  Total16      14322      16          
	Wayne       Total16       5236      16          
	Webster     Total16       2413      16          
	Wheeler     Total16        607      16          
	York        Total16      10013      16

If I really wanted to be correct, I could make those four digit years by adding 2000 to what the str_sub finds, but I also have to turn that into a number first.



In [7]:

    
longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
)









    





County variable value Year

	Adams       Republican10  10018      2010        
	Antelope    Republican10   3005      2010        
	Arthur      Republican10    284      2010        
	Banner      Republican10    424      2010        
	Blaine      Republican10    314      2010        
	Boone       Republican10   2390      2010        
	Box Butte   Republican10   4115      2010        
	Boyd        Republican10   1036      2010        
	Brown       Republican10   1663      2010        
	Buffalo     Republican10  15768      2010        
	Burt        Republican10   2521      2010        
	Butler      Republican10   3044      2010        
	Cass        Republican10   8216      2010        
	Cedar       Republican10   3072      2010        
	Chase       Republican10   1827      2010        
	Cherry      Republican10   2945      2010        
	Cheyenne    Republican10   4146      2010        
	Clay        Republican10   2889      2010        
	Colfax      Republican10   2364      2010        
	Cuming      Republican10   3639      2010        
	Custer      Republican10   5450      2010        
	Dakota      Republican10   3577      2010        
	Dawes       Republican10   3505      2010        
	Dawson      Republican10   7666      2010        
	Deuel       Republican10   1027      2010        
	Dixon       Republican10   1887      2010        
	Dodge       Republican10  10683      2010        
	Douglas     Republican10 120973      2010        
	Dundy       Republican10   1013      2010        
	Fillmore    Republican10   2317      2010        
	⋮ ⋮ ⋮ ⋮
	Nemaha      Total16       4563      2016        
	Nuckolls    Total16       3162      2016        
	Otoe        Total16      10447      2016        
	Pawnee      Total16       1858      2016        
	Perkins     Total16       1982      2016        
	Phelps      Total16       6235      2016        
	Pierce      Total16       4872      2016        
	Platte      Total16      20116      2016        
	Polk        Total16       3516      2016        
	Red Willow  Total16       7239      2016        
	Richardson  Total16       5669      2016        
	Rock        Total16       1024      2016        
	Saline      Total16       7830      2016        
	Sarpy       Total16     108473      2016        
	Saunders    Total16      14696      2016        
	Scotts Bluff Total16      24139      2016        
	Seward      Total16      10926      2016        
	Sheridan    Total16       4066      2016        
	Sherman     Total16       2010      2016        
	Sioux       Total16        919      2016        
	Stanton     Total16       3782      2016        
	Thayer      Total16       3791      2016        
	Thomas      Total16        506      2016        
	Thurston    Total16       4529      2016        
	Valley      Total16       2995      2016        
	Washington  Total16      14322      2016        
	Wayne       Total16       5236      2016        
	Webster     Total16       2413      2016        
	Wheeler     Total16        607      2016        
	York        Total16      10013      2016

Okay, so I have one side of it. Now I need the other. Luckily, we can tell str_sub to start at a character and end at one. In this case, we want to start at 1 and end at -3.



In [8]:

    
longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
    Party = str_sub(variable, 1, -3),
)









    





County variable value Year Party

	Adams       Republican10  10018      2010        Republican  
	Antelope    Republican10   3005      2010        Republican  
	Arthur      Republican10    284      2010        Republican  
	Banner      Republican10    424      2010        Republican  
	Blaine      Republican10    314      2010        Republican  
	Boone       Republican10   2390      2010        Republican  
	Box Butte   Republican10   4115      2010        Republican  
	Boyd        Republican10   1036      2010        Republican  
	Brown       Republican10   1663      2010        Republican  
	Buffalo     Republican10  15768      2010        Republican  
	Burt        Republican10   2521      2010        Republican  
	Butler      Republican10   3044      2010        Republican  
	Cass        Republican10   8216      2010        Republican  
	Cedar       Republican10   3072      2010        Republican  
	Chase       Republican10   1827      2010        Republican  
	Cherry      Republican10   2945      2010        Republican  
	Cheyenne    Republican10   4146      2010        Republican  
	Clay        Republican10   2889      2010        Republican  
	Colfax      Republican10   2364      2010        Republican  
	Cuming      Republican10   3639      2010        Republican  
	Custer      Republican10   5450      2010        Republican  
	Dakota      Republican10   3577      2010        Republican  
	Dawes       Republican10   3505      2010        Republican  
	Dawson      Republican10   7666      2010        Republican  
	Deuel       Republican10   1027      2010        Republican  
	Dixon       Republican10   1887      2010        Republican  
	Dodge       Republican10  10683      2010        Republican  
	Douglas     Republican10 120973      2010        Republican  
	Dundy       Republican10   1013      2010        Republican  
	Fillmore    Republican10   2317      2010        Republican  
	⋮ ⋮ ⋮ ⋮ ⋮
	Nemaha      Total16       4563      2016        Total       
	Nuckolls    Total16       3162      2016        Total       
	Otoe        Total16      10447      2016        Total       
	Pawnee      Total16       1858      2016        Total       
	Perkins     Total16       1982      2016        Total       
	Phelps      Total16       6235      2016        Total       
	Pierce      Total16       4872      2016        Total       
	Platte      Total16      20116      2016        Total       
	Polk        Total16       3516      2016        Total       
	Red Willow  Total16       7239      2016        Total       
	Richardson  Total16       5669      2016        Total       
	Rock        Total16       1024      2016        Total       
	Saline      Total16       7830      2016        Total       
	Sarpy       Total16     108473      2016        Total       
	Saunders    Total16      14696      2016        Total       
	Scotts Bluff Total16      24139      2016        Total       
	Seward      Total16      10926      2016        Total       
	Sheridan    Total16       4066      2016        Total       
	Sherman     Total16       2010      2016        Total       
	Sioux       Total16        919      2016        Total       
	Stanton     Total16       3782      2016        Total       
	Thayer      Total16       3791      2016        Total       
	Thomas      Total16        506      2016        Total       
	Thurston    Total16       4529      2016        Total       
	Valley      Total16       2995      2016        Total       
	Washington  Total16      14322      2016        Total       
	Wayne       Total16       5236      2016        Total       
	Webster     Total16       2413      2016        Total       
	Wheeler     Total16        607      2016        Total       
	York        Total16      10013      2016        Total

Two last things that are bothering me about our data? I don't like the number of voters being called value and we don't need variable. So we'll use dpylr to clean this up a bit.

First we're going to use dpylr's select and a little R trick where you can use a negative sign to mean not this.

Then, we'll use rename, which is backwards to me. If you read it, what it says is rename a column to Count and assign it the values of value. The reverse makes sense to me. Make values = Count, but dplyr does it the other way. If you get an error, you probably did it backwards (I did).



In [9]:

    
longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
    Party = str_sub(variable, 1, -3),
) %>% select (-c(variable)) %>% rename(Count=value)









    





County Count Year Party

	Adams      10018    2010      Republican
	Antelope    3005    2010      Republican
	Arthur       284    2010      Republican
	Banner       424    2010      Republican
	Blaine       314    2010      Republican
	Boone       2390    2010      Republican
	Box Butte   4115    2010      Republican
	Boyd        1036    2010      Republican
	Brown       1663    2010      Republican
	Buffalo    15768    2010      Republican
	Burt        2521    2010      Republican
	Butler      3044    2010      Republican
	Cass        8216    2010      Republican
	Cedar       3072    2010      Republican
	Chase       1827    2010      Republican
	Cherry      2945    2010      Republican
	Cheyenne    4146    2010      Republican
	Clay        2889    2010      Republican
	Colfax      2364    2010      Republican
	Cuming      3639    2010      Republican
	Custer      5450    2010      Republican
	Dakota      3577    2010      Republican
	Dawes       3505    2010      Republican
	Dawson      7666    2010      Republican
	Deuel       1027    2010      Republican
	Dixon       1887    2010      Republican
	Dodge      10683    2010      Republican
	Douglas   120973    2010      Republican
	Dundy       1013    2010      Republican
	Fillmore    2317    2010      Republican
	⋮ ⋮ ⋮ ⋮
	Nemaha        4563      2016        Total       
	Nuckolls      3162      2016        Total       
	Otoe         10447      2016        Total       
	Pawnee        1858      2016        Total       
	Perkins       1982      2016        Total       
	Phelps        6235      2016        Total       
	Pierce        4872      2016        Total       
	Platte       20116      2016        Total       
	Polk          3516      2016        Total       
	Red Willow    7239      2016        Total       
	Richardson    5669      2016        Total       
	Rock          1024      2016        Total       
	Saline        7830      2016        Total       
	Sarpy       108473      2016        Total       
	Saunders     14696      2016        Total       
	Scotts Bluff  24139      2016        Total       
	Seward       10926      2016        Total       
	Sheridan      4066      2016        Total       
	Sherman       2010      2016        Total       
	Sioux          919      2016        Total       
	Stanton       3782      2016        Total       
	Thayer        3791      2016        Total       
	Thomas         506      2016        Total       
	Thurston      4529      2016        Total       
	Valley        2995      2016        Total       
	Washington   14322      2016        Total       
	Wayne         5236      2016        Total       
	Webster       2413      2016        Total       
	Wheeler        607      2016        Total       
	York         10013      2016        Total



In [10]:

    
newlongvoters <- longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
    Party = str_sub(variable, 1, -3),
) %>% select (-c(variable)) %>% rename(Count=value)

Then, we can put it back together again by casting it using dcast. With dcast, we need to tell it which variable is our main identifier -- which is County -- and what the headers should be. We also have to tell it where the numbers should come from, since we blew it all apart. And you'll see, we've changed the data substantially, but it looks almost identical to the original dataset.



In [11]:

    
widevoters <- dcast(newlongvoters, County ~ Party+Year, value.var = "Count")
head(widevoters)









    





County Democrat_2010 Democrat_2016 Libertarian_2010 Libertarian_2016 Nonpartisan_2010 Nonpartisan_2016 Republican_2010 Republican_2016 Total_2010 Total_2016

	Adams   5536    5027    6       163     2972    3591    10018   10746   18532   19527   
	Antelope 1147     863    0        12      538     594     3005    3088    4690    4557   
	Arthur    52      37    0         3       10      15      284     286     346     341   
	Banner    53      38    0         7       53      73      424     427     530     545   
	Blaine    56      43    0         2       24      29      314     310     394     384   
	Boone   1156     901    0        11      408     404     2390    2469    3954    3785

Assignment

Melt the population estimates data from assignment 3. Create long data, where each row is a single year for a single county, with columns for the state, county, year and estimate.

Rubric

Did you import the data correctly?
Did you apply melt correctly?
Did you mutate/rename columns correctly?
Did you explain your steps using Markdown comments?



In [ ]:

County	Republican10	Democrat10	Libertarian10	Nonpartisan10	Total10	Republican16	Democrat16	Nonpartisan16	Libertarian16	Total16
Adams	10018	5536	6	2972	18532	10746	5027	3591	163	19527
Antelope	3005	1147	0	538	4690	3088	863	594	12	4557
Arthur	284	52	0	10	346	286	37	15	3	341
Banner	424	53	0	53	530	427	38	73	7	545
Blaine	314	56	0	24	394	310	43	29	2	384
Boone	2390	1156	0	408	3954	2469	901	404	11	3785

County	variable	value
Adams	Republican10	10018
Antelope	Republican10	3005
Arthur	Republican10	284
Banner	Republican10	424
Blaine	Republican10	314
Boone	Republican10	2390

County	Count	Year	Party
Adams	10018	2010	Republican
Antelope	3005	2010	Republican
Arthur	284	2010	Republican
Banner	424	2010	Republican
Blaine	314	2010	Republican
Boone	2390	2010	Republican
Box Butte	4115	2010	Republican
Boyd	1036	2010	Republican
Brown	1663	2010	Republican
Buffalo	15768	2010	Republican
Burt	2521	2010	Republican
Butler	3044	2010	Republican
Cass	8216	2010	Republican
Cedar	3072	2010	Republican
Chase	1827	2010	Republican
Cherry	2945	2010	Republican
Cheyenne	4146	2010	Republican
Clay	2889	2010	Republican
Colfax	2364	2010	Republican
Cuming	3639	2010	Republican
Custer	5450	2010	Republican
Dakota	3577	2010	Republican
Dawes	3505	2010	Republican
Dawson	7666	2010	Republican
Deuel	1027	2010	Republican
Dixon	1887	2010	Republican
Dodge	10683	2010	Republican
Douglas	120973	2010	Republican
Dundy	1013	2010	Republican
Fillmore	2317	2010	Republican
⋮	⋮	⋮	⋮
Nemaha	4563	2016	Total
Nuckolls	3162	2016	Total
Otoe	10447	2016	Total
Pawnee	1858	2016	Total
Perkins	1982	2016	Total
Phelps	6235	2016	Total
Pierce	4872	2016	Total
Platte	20116	2016	Total
Polk	3516	2016	Total
Red Willow	7239	2016	Total
Richardson	5669	2016	Total
Rock	1024	2016	Total
Saline	7830	2016	Total
Sarpy	108473	2016	Total
Saunders	14696	2016	Total
Scotts Bluff	24139	2016	Total
Seward	10926	2016	Total
Sheridan	4066	2016	Total
Sherman	2010	2016	Total
Sioux	919	2016	Total
Stanton	3782	2016	Total
Thayer	3791	2016	Total
Thomas	506	2016	Total
Thurston	4529	2016	Total
Valley	2995	2016	Total
Washington	14322	2016	Total
Wayne	5236	2016	Total
Webster	2413	2016	Total
Wheeler	607	2016	Total
York	10013	2016	Total